Skip to main content

Writing Python C Extensions

The Three Paths to Native Speed

In 2020, a computer vision team at a medical imaging company profiled their Python image processing pipeline. One function - a custom 2D median filter - consumed 73% of total CPU time. The function was 40 lines of pure Python, called 1.2 million times per day on 512×512 images. Pure Python iteration over pixel arrays is inherently slow: every pixel access is a dictionary lookup (the array's __getitem__), a reference count increment, a bounds check, and a Python object allocation for the returned integer.

Three options, with measured speedups for their specific workload:

Option | Speedup | Complexity | Portability | When to use
─────────────────┼─────────┼────────────┼─────────────┼──────────────────────────────
Cython | 30–60x | Low | High | NumPy-heavy code, typed Python
ctypes | 20–50x | Medium | High | Call existing shared libraries
Python C API | 40–80x | High | Medium | Maximum control, new algorithms

The team chose the Python C API because their algorithm involved custom memory layouts that Cython's type system didn't handle cleanly. The resulting C extension ran their median filter 54x faster - 73% CPU time dropped to 2.1%.

This lesson covers all three paths in depth, starting with the C API.

Python C API Overview

Every C extension starts with a single include:

#define PY_SSIZE_T_CLEAN /* Required for Py_ssize_t vs int in format strings */
#include <Python.h>

This header is typically at /usr/include/python3.12/Python.h (Linux) or inside the Xcode SDK on macOS. It pulls in:

  • Object type definitions (PyObject, PyLongObject, PyListObject, etc.)
  • Reference counting macros (Py_INCREF, Py_DECREF, Py_XDECREF)
  • Argument parsing (PyArg_ParseTuple, PyArg_ParseTupleAndKeywords)
  • Error handling (PyErr_SetString, PyErr_NoMemory)
  • Type creation and module initialization infrastructure
  • The GIL macros (Py_BEGIN_ALLOW_THREADS, Py_END_ALLOW_THREADS)

The key abstraction is PyObject * - every Python object in C is a pointer to a PyObject. Python's type system, reference counting, and garbage collection all flow through this pointer type.

Writing a C Extension from Scratch

We will build a module called fastmath with two functions: fast_sum (demonstrates basic argument parsing and return values) and dot_product (demonstrates working with Python sequences).

The C Source File: fastmathmodule.c

/*
* fastmathmodule.c - example Python C extension
*
* Build:
* python setup.py build_ext --inplace
*
* Usage:
* import fastmath
* fastmath.fast_sum([1, 2, 3, 4, 5]) # -> 15
* fastmath.dot_product([1,2,3], [4,5,6]) # -> 32
*/

#define PY_SSIZE_T_CLEAN
#include <Python.h>

/* ─── fast_sum ───────────────────────────────────────────────────────────── */

/*
* fast_sum(numbers) -> int
*
* Sums a list of Python integers without creating intermediate Python objects.
*/
static PyObject *
fastmath_fast_sum(PyObject *self, PyObject *args)
{
PyObject *sequence;
long long total = 0;
Py_ssize_t i, n;

/* PyArg_ParseTuple: parse positional arguments.
* "O" format: a single Python object (no conversion), stored in `sequence`.
* Returns 0 on failure (sets an exception); nonzero on success.
*/
if (!PyArg_ParseTuple(args, "O", &sequence)) {
return NULL; /* exception already set by PyArg_ParseTuple */
}

/* Get the sequence length - works on list, tuple, range, etc. */
n = PySequence_Length(sequence);
if (n < 0) {
/* PySequence_Length returns -1 and sets TypeError for non-sequences */
return NULL;
}

for (i = 0; i < n; i++) {
/* PySequence_GetItem returns a NEW REFERENCE - we own it and must DECREF */
PyObject *item = PySequence_GetItem(sequence, i);
if (item == NULL) {
return NULL; /* IndexError set by GetItem */
}

/* Convert to C long long */
long long val = PyLong_AsLongLong(item);

/* DECREF: we're done with this item - return it to the reference count */
Py_DECREF(item);

/* PyLong_AsLongLong returns -1 on error (e.g., item is a float or string) */
if (val == -1 && PyErr_Occurred()) {
return NULL;
}

total += val;
}

/* Build and return a Python integer from C long long */
return PyLong_FromLongLong(total);
}

/* ─── dot_product ────────────────────────────────────────────────────────── */

/*
* dot_product(a, b) -> float
*
* Computes the dot product of two equal-length sequences of numbers.
*/
static PyObject *
fastmath_dot_product(PyObject *self, PyObject *args)
{
PyObject *seq_a, *seq_b;
Py_ssize_t na, nb, i;
double total = 0.0;

/* "OO" format: two Python objects */
if (!PyArg_ParseTuple(args, "OO", &seq_a, &seq_b)) {
return NULL;
}

na = PySequence_Length(seq_a);
nb = PySequence_Length(seq_b);

if (na < 0 || nb < 0) {
return NULL;
}

if (na != nb) {
/* Set a custom exception */
PyErr_SetString(PyExc_ValueError,
"dot_product: sequences must have the same length");
return NULL;
}

for (i = 0; i < na; i++) {
PyObject *a_item = PySequence_GetItem(seq_a, i);
PyObject *b_item = PySequence_GetItem(seq_b, i);

if (a_item == NULL || b_item == NULL) {
/* Py_XDECREF: like Py_DECREF but safe for NULL pointers */
Py_XDECREF(a_item);
Py_XDECREF(b_item);
return NULL;
}

/* PyFloat_AsDouble handles both int and float Python objects */
double a_val = PyFloat_AsDouble(a_item);
double b_val = PyFloat_AsDouble(b_item);

Py_DECREF(a_item);
Py_DECREF(b_item);

if (a_val == -1.0 && PyErr_Occurred()) return NULL;
if (b_val == -1.0 && PyErr_Occurred()) return NULL;

total += a_val * b_val;
}

/* Build and return a Python float from C double */
return PyFloat_FromDouble(total);
}

/* ─── Method Table ────────────────────────────────────────────────────────── */

/*
* The PyMethodDef array lists all functions exposed by this module.
* Each entry: { "python_name", c_function, calling_convention, docstring }
*
* METH_VARARGS: function receives *args as a tuple (PyArg_ParseTuple)
* METH_KEYWORDS: also receives **kwargs (use PyArg_ParseTupleAndKeywords)
* METH_NOARGS: no arguments at all
* METH_O: exactly one argument (more efficient than METH_VARARGS for single arg)
*/
static PyMethodDef FastmathMethods[] = {
{
"fast_sum",
fastmath_fast_sum,
METH_VARARGS,
"fast_sum(numbers) -> int\n\n"
"Sum a sequence of integers. Significantly faster than sum() for large lists."
},
{
"dot_product",
fastmath_dot_product,
METH_VARARGS,
"dot_product(a, b) -> float\n\n"
"Compute the dot product of two equal-length sequences."
},
{NULL, NULL, 0, NULL} /* Sentinel: marks end of the method table */
};

/* ─── Module Definition ───────────────────────────────────────────────────── */

static struct PyModuleDef fastmathmodule = {
PyModuleDef_HEAD_INIT, /* always this value */
"fastmath", /* module name */
/* module docstring - NULL if none */
"Fast mathematical operations implemented in C.",
-1, /* per-interpreter state size; -1 = no state */
FastmathMethods, /* method table defined above */
};

/* ─── Module Init Function ────────────────────────────────────────────────── */

/*
* PyMODINIT_FUNC: declares return type (PyObject*) and calling convention.
* The function name MUST be PyInit_<module_name>.
* Python calls this when the module is first imported.
*/
PyMODINIT_FUNC
PyInit_fastmath(void)
{
/* Create the module object */
PyObject *module = PyModule_Create(&fastmathmodule);
if (module == NULL) {
return NULL;
}

/* Add module-level constants */
if (PyModule_AddIntConstant(module, "VERSION_MAJOR", 1) < 0 ||
PyModule_AddIntConstant(module, "VERSION_MINOR", 0) < 0 ||
PyModule_AddStringConstant(module, "AUTHOR", "EngineersOfAI") < 0) {
Py_DECREF(module);
return NULL;
}

return module;
}

The setup.py Build Script

# setup.py
from setuptools import setup, Extension
import sys

# Conditional compilation flags based on platform
extra_compile_args = ["-O3", "-march=native"]
if sys.platform == "linux":
extra_compile_args.append("-ffast-math")

fastmath_module = Extension(
"fastmath", # module name (must match PyInit_ function)
sources=["fastmathmodule.c"], # C source files
extra_compile_args=extra_compile_args,
# Include directories beyond the default (Python.h location auto-detected)
# include_dirs=["/usr/local/include"],
# Libraries to link against
# libraries=["m"], # link libm for math functions
# library_dirs=["/usr/local/lib"],
)

setup(
name="fastmath",
version="1.0",
description="Fast math operations in C",
ext_modules=[fastmath_module],
)
# Build in-place (.so file next to setup.py)
python setup.py build_ext --inplace

# The output is:
# fastmath.cpython-312-x86_64-linux-gnu.so (Linux)
# fastmath.cpython-312-darwin.so (macOS)

# Test it
python -c "
import fastmath
print(fastmath.fast_sum([1, 2, 3, 4, 5])) # 15
print(fastmath.dot_product([1, 2, 3], [4, 5, 6])) # 32.0
print(fastmath.VERSION_MAJOR, fastmath.AUTHOR) # 1 EngineersOfAI
help(fastmath.fast_sum)
"

Error Handling in C Extensions

Every C extension function that can fail must return NULL and set an exception. Python's exception system is thread-local: one pending exception per thread at a time.

/* Setting built-in exceptions */
PyErr_SetString(PyExc_ValueError, "value must be positive");
PyErr_SetString(PyExc_TypeError, "argument must be a string");
PyErr_SetString(PyExc_RuntimeError, "internal error occurred");
PyErr_SetString(PyExc_IndexError, "list index out of range");
PyErr_SetString(PyExc_KeyError, "key not found");

/* Setting exceptions with formatted messages */
PyErr_Format(PyExc_ValueError,
"expected positive integer, got %ld", (long)value);

/* Memory allocation failure */
void *ptr = PyMem_Malloc(1024);
if (ptr == NULL) {
PyErr_NoMemory(); /* sets MemoryError */
return NULL;
}

/* Check if an exception is already set (without clearing it) */
if (PyErr_Occurred()) {
/* exception is pending; propagate it */
return NULL;
}

/* Clear a pending exception (rarely needed; usually propagate) */
PyErr_Clear();

/* Save and restore exception state (for cleanup code) */
PyObject *exc_type, *exc_value, *exc_tb;
PyErr_Fetch(&exc_type, &exc_value, &exc_tb); /* save */
/* ... do cleanup that might also fail ... */
PyErr_Restore(exc_type, exc_value, exc_tb); /* restore */

Error Handling Pattern: Centralized Cleanup

static PyObject *
example_with_cleanup(PyObject *self, PyObject *args)
{
PyObject *input = NULL;
PyObject *result = NULL;
char *buf = NULL;

if (!PyArg_ParseTuple(args, "O", &input)) {
goto error; /* exception already set */
}

buf = (char *)PyMem_Malloc(4096);
if (buf == NULL) {
PyErr_NoMemory();
goto error;
}

result = PyList_New(0);
if (result == NULL) {
goto error; /* MemoryError already set */
}

/* ... do work, building result ... */
if (PyList_Append(result, PyLong_FromLong(42)) < 0) {
goto error;
}

PyMem_Free(buf);
return result;

error:
PyMem_Free(buf); /* PyMem_Free(NULL) is safe */
Py_XDECREF(result); /* Py_XDECREF(NULL) is safe */
return NULL;
}

Reference Counting in C

Python's memory management is based on reference counting. Every PyObject has a ob_refcnt field. When it reaches 0, the object is deallocated. Getting reference counting wrong in C extensions causes:

  • Memory leaks (refcount never reaches 0)
  • Use-after-free / segfaults (refcount reaches 0 too early)
  • Corrupted objects (double-free)

The Rules

Rule 1: You own the reference you create.
PyLong_FromLong(42) returns a new reference. You own it.
You must eventually DECREF it.

Rule 2: You own references returned by functions documented as returning NEW references.
PySequence_GetItem() → new reference. You own it.
PyDict_GetItemWithError() → new reference. You own it.
PyDict_GetItem() → BORROWED reference. You do NOT own it.

Rule 3: Borrowed references do NOT require DECREF.
They are valid only as long as the container lives.
If you need to keep a borrowed reference, INCREF it to own it.

Rule 4: Functions that "steal" a reference take ownership from you.
You must NOT DECREF after passing a stolen reference.
PyList_SET_ITEM(list, i, item) → steals reference to item
PyTuple_SET_ITEM(tuple, i, item) → steals reference to item
PyModule_AddObject(module, name, obj) → steals reference (pre-3.10)
/* ─── New reference: you own it, must DECREF ─── */
PyObject *n = PyLong_FromLong(42); /* refcount = 1, you own it */
Py_INCREF(n); /* refcount = 2 */
Py_DECREF(n); /* refcount = 1 */
Py_DECREF(n); /* refcount = 0 → object freed */
/* Do NOT use n after this point */

/* ─── Py_XDECREF: safe for NULL ─── */
PyObject *maybe_null = NULL;
Py_XDECREF(maybe_null); /* no-op, does not crash */

/* ─── Py_RETURN_NONE: increment None's refcount and return it ─── */
/* NEVER just "return Py_None" - that would decrement None's refcount
when the caller DECREFs the return value, eventually freeing None! */
static PyObject *
returns_none(PyObject *self, PyObject *args)
{
Py_RETURN_NONE; /* equivalent to: Py_INCREF(Py_None); return Py_None; */
}

/* ─── Borrowed reference: do NOT DECREF ─── */
PyObject *dict = PyDict_New(); /* new reference */
PyObject *key = PyUnicode_FromString("x"); /* new reference */
PyObject *val = PyLong_FromLong(1); /* new reference */
PyDict_SetItem(dict, key, val); /* does NOT steal key or val */
Py_DECREF(key); /* still need to release our copies */
Py_DECREF(val);

PyObject *item = PyDict_GetItem(dict, key); /* BORROWED - dict owns it */
/* Use item here - do NOT Py_DECREF(item) */
/* item becomes invalid if dict is modified or freed */
Py_DECREF(dict); /* frees dict and all its contents (if no other refs) */

/* ─── Stealing references ─── */
PyObject *list = PyList_New(3); /* new list with 3 NULL slots */
PyList_SET_ITEM(list, 0, PyLong_FromLong(10)); /* SET_ITEM steals reference */
PyList_SET_ITEM(list, 1, PyLong_FromLong(20)); /* DO NOT Py_DECREF these */
PyList_SET_ITEM(list, 2, PyLong_FromLong(30)); /* list now owns them */
/* PyList_Append does NOT steal - it creates an internal reference */

Releasing the GIL for CPU-Bound C Code

The GIL prevents multiple Python threads from executing Python bytecode simultaneously. In a C extension, once you are in pure C code with no Python objects, you can release the GIL to allow other Python threads to run concurrently.

/*
* CPU-intensive function that releases the GIL.
* Other Python threads can run while this executes.
*/
static PyObject *
fastmath_parallel_sum(PyObject *self, PyObject *args)
{
Py_buffer view;
double result;

/* Accept a buffer (memoryview, bytes, bytearray, numpy array) */
if (!PyArg_ParseTuple(args, "y*", &view)) {
return NULL;
}

/* Extract a pointer to the raw data BEFORE releasing the GIL */
const double *data = (const double *)view.buf;
Py_ssize_t n = view.len / sizeof(double);

/* Release the GIL: other Python threads can run now.
* After this point, do NOT touch any Python objects.
* Do NOT call any PyXxx functions (except from a re-acquired GIL).
*/
Py_BEGIN_ALLOW_THREADS

result = 0.0;
for (Py_ssize_t i = 0; i < n; i++) {
result += data[i];
}
/* Could also: spawn POSIX threads here, do I/O, call blocking C libraries */

/* Re-acquire the GIL before returning to Python */
Py_END_ALLOW_THREADS

/* Now safe to create Python objects again */
PyBuffer_Release(&view);
return PyFloat_FromDouble(result);
}

The Py_BEGIN_ALLOW_THREADS / Py_END_ALLOW_THREADS macros expand to:

/* BEGIN_ALLOW_THREADS */
PyThreadState *_save = PyEval_SaveThread(); /* releases GIL */

/* ... C code without Python objects ... */

/* END_ALLOW_THREADS */
PyEval_RestoreThread(_save); /* re-acquires GIL */

Accepting NumPy Arrays with the Buffer Protocol

Python's buffer protocol allows C extensions to accept any object that exposes a contiguous memory buffer - NumPy arrays, bytearray, bytes, memoryview. You don't need to import NumPy in your C code.

#define PY_SSIZE_T_CLEAN
#include <Python.h>

/*
* numpy_sum(array) -> float
*
* Accepts any buffer object (numpy array, memoryview, bytearray)
* and interprets it as an array of doubles.
*/
static PyObject *
fastmath_numpy_sum(PyObject *self, PyObject *args)
{
Py_buffer view;
double total = 0.0;

/* "y*" format: accepts any buffer-supporting object, no copying.
* Fills a Py_buffer struct with buf (pointer), len (total bytes),
* itemsize, format string, shape, strides, etc.
*
* For a C-contiguous numpy array of float64:
* view.buf = pointer to first element
* view.len = n_elements * 8
* view.format = "d" (double)
* view.itemsize = 8
*/
if (!PyArg_ParseTuple(args, "y*", &view)) {
return NULL;
}

/* Validate: must be float64 */
if (view.format == NULL || view.format[0] != 'd') {
PyBuffer_Release(&view);
PyErr_SetString(PyExc_TypeError,
"numpy_sum expects a float64 array (format 'd')");
return NULL;
}

Py_ssize_t n = view.len / view.itemsize;
const double *data = (const double *)view.buf;

/* Release GIL for the computation */
Py_BEGIN_ALLOW_THREADS
for (Py_ssize_t i = 0; i < n; i++) {
total += data[i];
}
Py_END_ALLOW_THREADS

/* Always release the buffer - this decrements the underlying object's
* buffer count and may trigger deallocation */
PyBuffer_Release(&view);

return PyFloat_FromDouble(total);
}

Python usage:

import numpy as np
import fastmath

arr = np.arange(1_000_000, dtype=np.float64)
result = fastmath.numpy_sum(arr)
print(f"Sum: {result:.0f}") # 499999500000.0

# Also works with memoryview and bytearray
import struct
buf = bytearray(struct.pack("d" * 5, 1.0, 2.0, 3.0, 4.0, 5.0))
# Note: need appropriate format for bytearray - use memoryview cast

ctypes: Calling Any Shared Library

ctypes loads compiled shared libraries and calls their functions from Python without writing any C glue code. It is the fastest path to using an existing C/C++ library.

import ctypes
import ctypes.util

# ─── Loading a library ───────────────────────────────────────────────────────

# Load the C standard library
libc = ctypes.CDLL(ctypes.util.find_library("c"))

# Load a specific .so or .dylib file
# libm = ctypes.CDLL("/usr/lib/x86_64-linux-gnu/libm.so.6")
libm = ctypes.CDLL(ctypes.util.find_library("m"))

# ─── Calling a simple function ───────────────────────────────────────────────

# sqrt: declare argument and return types for type safety
libm.sqrt.argtypes = [ctypes.c_double]
libm.sqrt.restype = ctypes.c_double

result = libm.sqrt(2.0)
print(f"sqrt(2.0) = {result:.6f}") # 1.414214

# ─── ctypes primitive types ──────────────────────────────────────────────────
# c_bool, c_byte, c_ubyte
# c_short, c_ushort
# c_int, c_uint, c_long, c_ulong, c_longlong, c_ulonglong
# c_float, c_double, c_longdouble
# c_char, c_wchar, c_char_p (null-terminated string), c_wchar_p
# c_void_p (void pointer)

# ─── String handling ─────────────────────────────────────────────────────────

libc.strlen.argtypes = [ctypes.c_char_p]
libc.strlen.restype = ctypes.c_size_t

s = b"hello, world"
length = libc.strlen(s)
print(f"strlen('{s.decode()}') = {length}")

# ─── Pointers ────────────────────────────────────────────────────────────────

# c_int pointer
val = ctypes.c_int(42)
ptr = ctypes.pointer(val)
print(f"Value via pointer: {ptr.contents.value}")

# Modify through pointer
ptr.contents.value = 100
print(f"Modified value: {val.value}")

# ─── Structures ──────────────────────────────────────────────────────────────

class Point(ctypes.Structure):
_fields_ = [
("x", ctypes.c_double),
("y", ctypes.c_double),
]
def __repr__(self):
return f"Point({self.x}, {self.y})"

p = Point(3.0, 4.0)
print(f"Point: {p}")
print(f"Size: {ctypes.sizeof(Point)} bytes") # 16 bytes

# Array of structures
Points = Point * 3
arr = Points(Point(0, 0), Point(1, 2), Point(3, 4))
for i, pt in enumerate(arr):
print(f" arr[{i}] = {pt}")

# ─── Calling a custom shared library ─────────────────────────────────────────

# Assume we have compiled: gcc -shared -fPIC -O2 -o libvec.so vec.c
# with a function: double vec_dot(double *a, double *b, int n)

def call_vec_dot_example():
"""
Example of calling a custom shared library's dot product function.
This is illustrative - requires an actual libvec.so to run.
"""
try:
lib = ctypes.CDLL("./libvec.so")
except OSError:
print("libvec.so not found - compile it first")
return

lib.vec_dot.argtypes = [
ctypes.POINTER(ctypes.c_double),
ctypes.POINTER(ctypes.c_double),
ctypes.c_int,
]
lib.vec_dot.restype = ctypes.c_double

import array
a = array.array("d", [1.0, 2.0, 3.0])
b = array.array("d", [4.0, 5.0, 6.0])

# Get pointer to the underlying buffer
a_ptr = (ctypes.c_double * len(a)).from_buffer(a)
b_ptr = (ctypes.c_double * len(b)).from_buffer(b)

result = lib.vec_dot(a_ptr, b_ptr, len(a))
print(f"vec_dot([1,2,3], [4,5,6]) = {result}") # 32.0

cffi: A Cleaner Alternative to ctypes

CFFI (C Foreign Function Interface) lets you paste C function declarations directly from header files. It is the recommended approach for new code calling external C libraries.

from cffi import FFI

ffi = FFI()

# ─── ABI mode: load existing shared library by its ABI ──────────────────────
# Just paste the C function declarations you want to call

ffi.cdef("""
/* From <math.h> */
double sqrt(double x);
double pow(double x, double y);
double log(double x);

/* From <string.h> */
size_t strlen(const char *s);
char *strcpy(char *dst, const char *src);

/* From <stdlib.h> */
void *malloc(size_t size);
void free(void *ptr);
int abs(int j);
""")

# Load the C standard library
C = ffi.dlopen(None) # None = default C library

print(C.sqrt(2.0)) # 1.4142135623730951
print(C.pow(2.0, 10.0)) # 1024.0
print(C.abs(-42)) # 42
print(C.strlen(b"hello")) # 5

# Allocate memory through malloc
buf = C.malloc(256)
if buf == ffi.NULL:
raise MemoryError("malloc failed")
ffi.cast("char *", buf)[0:5] = b"hello"
print(bytes(ffi.cast("char *", buf)[0:5])) # b'hello'
C.free(buf)

# ─── API mode: compile a C extension at install time ────────────────────────
# API mode generates C code, compiles it, and loads the resulting .so
# This is faster than ABI mode and allows inline C

ffi_api = FFI()
ffi_api.cdef("""
int add(int a, int b);
double fast_exp(double x);
""")

# The C source for our functions
ffi_api.set_source("_mymath_cffi",
"""
#include <math.h>

int add(int a, int b) {
return a + b;
}

double fast_exp(double x) {
/* Fast approximation using Taylor series */
return 1.0 + x + (x*x)/2.0 + (x*x*x)/6.0;
}
""",
libraries=["m"],
)

# To compile: run ffi_api.compile() or install via setup.py
# import subprocess
# ffi_api.compile(verbose=True)
# from _mymath_cffi import ffi, lib
# print(lib.add(3, 4)) # 7
# print(lib.fast_exp(0.5)) # ~1.6487

Building with setuptools vs meson-python

setuptools (traditional, widely supported)

# setup.py
from setuptools import setup, Extension
import numpy as np

ext = Extension(
"fastmath",
sources=["fastmathmodule.c"],
include_dirs=[np.get_include()], # add NumPy headers
extra_compile_args=["-O3", "-march=native", "-ffast-math"],
extra_link_args=[],
define_macros=[("NPY_NO_DEPRECATED_API", "NPY_1_7_API_VERSION")],
)

setup(
name="fastmath",
ext_modules=[ext],
# PEP 517 build backend declaration
)
# pyproject.toml (modern setuptools)
[build-system]
requires = ["setuptools>=61", "numpy"]
build-backend = "setuptools.backends.legacy:build"

[project]
name = "fastmath"
version = "1.0.0"
# Build and install in development mode
pip install -e . --no-build-isolation

# Build a wheel
python -m build --wheel

meson-python (modern, used by NumPy, SciPy)

# meson.build
project('fastmath', 'c',
version : '1.0',
default_options : ['c_std=c11', 'optimization=3'])

py = import('python').find_installation()

py.extension_module(
'fastmath',
'fastmathmodule.c',
install : true,
include_directories : py.get_path('platinclude'),
)
# pyproject.toml for meson-python
[build-system]
requires = ["meson-python", "ninja"]
build-backend = "mesonpy"

Complete Working Example: Benchmark

Let us benchmark our fastmath.fast_sum against Python's built-in sum and NumPy:

import time
import random
import sys

# Assume fastmath.so is built
try:
import fastmath
HAS_FASTMATH = True
except ImportError:
HAS_FASTMATH = False
print("fastmath not built - run: python setup.py build_ext --inplace")

import numpy as np

N = 1_000_000
data = [random.randint(0, 1000) for _ in range(N)]
arr = np.array(data, dtype=np.int64)

def bench(label, func, *args):
# Warm up
for _ in range(3):
func(*args)
# Measure
t0 = time.perf_counter()
for _ in range(100):
result = func(*args)
elapsed = (time.perf_counter() - t0) / 100
print(f" {label:<25} {elapsed*1000:.2f}ms result={result}")

print(f"\nBenchmark: sum of {N:,} integers (100 iterations each)")
print("─" * 60)
bench("Python sum(list)", sum, data)
bench("numpy.sum(array)", np.sum, arr)
if HAS_FASTMATH:
bench("fastmath.fast_sum(list)", fastmath.fast_sum, data)

# Typical results on modern hardware:
# Python sum(list) 8.20ms result=500089441
# numpy.sum(array) 0.42ms result=500089441
# fastmath.fast_sum(list) 3.10ms result=500089441
#
# fastmath is faster than Python sum (no Python object allocation per item)
# but slower than numpy (numpy operates on a C contiguous array; we iterate a Python list)
# The real win for fastmath comes when implementing algorithms numpy doesn't have

Interview Q&A

Q1: What is a reference count leak in a C extension and how does it manifest?

A reference count leak occurs when a C extension increments a PyObject's reference count (via Py_INCREF or by receiving a new reference from a C API call) but never decrements it (via Py_DECREF). The object's refcount never reaches zero, so Python's allocator never frees it. The object accumulates in memory for the lifetime of the process.

Leaks in C extensions are particularly insidious because they: (1) appear as steady memory growth with no obvious Python-level cause, (2) are not caught by Python's cyclic garbage collector (which only handles reference cycles, not leaked positive counts), and (3) are difficult to trace with standard profiling tools. The common causes are: forgetting Py_DECREF(item) after PySequence_GetItem(seq, i) in a loop; using PyList_Append() (which internally INCREFs) and then also keeping your own reference; building error-handling paths that skip the DECREF on early return. The fix is to use tools like tracemalloc, Valgrind with the Python suppression file, or refleaks test mode (python -m test -R 3:3 test_mymodule). Code review: every code path in every C function must either DECREF or transfer ownership of every new reference it holds.

Q2: Explain the difference between borrowed and owned references in the Python C API. Give examples of API functions that return each.

An owned reference (new reference) means the caller is responsible for eventually calling Py_DECREF. The API "gave" you the object; you now own a reference count increment. You must DECREF it when done, or transfer ownership (e.g., via PyList_SET_ITEM which steals the reference).

A borrowed reference means the object's refcount was not incremented for you. The reference is valid as long as the container holding it is alive, but you do not own it. You must not call Py_DECREF on a borrowed reference. If you need to store a borrowed reference beyond the container's guaranteed lifetime, call Py_INCREF to convert it to an owned reference.

Functions returning new references (you own, must DECREF): PyLong_FromLong, PyFloat_FromDouble, PyUnicode_FromString, PyList_New, PyDict_New, PySequence_GetItem, PyObject_GetAttrString, PyObject_Call, PyImport_ImportModule.

Functions returning borrowed references (do NOT DECREF): PyList_GetItem, PyTuple_GetItem, PyDict_GetItem, PySequence_Fast_GET_ITEM, PySys_GetObject. Note the asymmetry: PyList_GetItem is borrowed, but PySequence_GetItem is new. Getting this wrong is the most common source of C extension bugs.

Q3: When and how do you release the GIL in a C extension, and what operations are forbidden while the GIL is released?

You release the GIL when your C code will perform a long-running operation that does not access Python objects: CPU-intensive computation (matrix multiply, hash computation, compression), blocking I/O (file read, network recv), or calls to thread-safe C libraries. Releasing the GIL allows other Python threads to run concurrently during your operation, improving throughput in multithreaded Python programs.

The mechanism is: Py_BEGIN_ALLOW_THREADS (calls PyEval_SaveThread(), releases the GIL, saves the thread state) and Py_END_ALLOW_THREADS (calls PyEval_RestoreThread(), re-acquires the GIL). These macros must be balanced - every BEGIN must have a corresponding END, including in all error paths. Between them, these operations are forbidden: calling any PyXxx function, accessing any PyObject *, creating Python objects, raising Python exceptions (PyErr_SetString), acquiring Python locks. These operations are permitted: C standard library functions, POSIX syscalls, arithmetic on C types, calling GIL-free C libraries. If you need to signal an error from within the no-GIL section, set a C-level flag and check it after END_ALLOW_THREADS, then call PyErr_SetString.

Q4: What is the buffer protocol and why is it more general than accepting only bytes or numpy.ndarray?

The buffer protocol is a Python object interface defined in bufferobject.h. Any object that implements __buffer__ / bf_getbuffer exposes a contiguous (or strided) memory region to C code without copying. The protocol conveys: a pointer to the raw memory (buf), total length in bytes (len), item size (itemsize), a format string describing the type (e.g., "d" for double, "B" for unsigned byte), shape (for multidimensional arrays), and strides.

In C, use PyArg_ParseTuple(args, "y*", &view) to accept any buffer-supporting object. This handles: bytes, bytearray, memoryview, NumPy arrays of any dtype, array.array, and any third-party type implementing the protocol. Without the buffer protocol, you would need to special-case each type: if (PyBytes_Check(obj)) ... else if (PyObject_TypeName == "ndarray") .... The buffer protocol provides a single, zero-copy, type-safe interface for all of them. The Py_buffer struct returned by PyBUF_SIMPLE or PyBUF_FORMAT gives you everything needed to operate on the raw memory, and PyBuffer_Release(&view) properly releases the buffer when you are done.

Q5: Compare ctypes and cffi - when would you choose each, and what are the performance tradeoffs?

Both ctypes and cffi allow calling compiled C code from Python without writing C glue code. The differences are in API ergonomics, performance, and use case fit.

ctypes is in the standard library (no install required), uses Python classes to describe C types (c_int, c_double, Structure), and loads .so/.dll files at runtime. The overhead per call is significant - ctypes marshals Python objects to C arguments and back using Python's type system on every call, adding ~1–5 μs overhead per call. It is best for: calling an existing system library occasionally (one-shot operations, low call frequency), system administration scripts, and environments where installing packages is restricted.

cffi (requires pip install cffi) operates in two modes. ABI mode (ffi.dlopen()) is similar to ctypes but uses C-declaration syntax directly - you paste function prototypes from header files, which is less error-prone than manually constructing ctypes types. ABI mode has similar per-call overhead to ctypes. API mode (ffi.set_source() + compile) generates a C wrapper at build time and compiles it into a .so. At runtime, function calls go through the compiled C wrapper with minimal Python overhead - typically 100–500 ns per call, 5–10x faster than ctypes ABI mode. cffi API mode is the right choice for: performance-critical hot-path code, libraries you control (you can adjust their API), and new projects where you are willing to add a build step. The Python C API extension remains the fastest option (~50–100 ns overhead) and gives the most control over memory layout and GIL management, at the cost of significantly more code to write and maintain.

© 2026 EngineersOfAI. All rights reserved.